Typos in Czech Corpora

نویسنده

Marek Grác

چکیده

The extended usage of written corpora not only for manual querying but also for machine learning led to the creation of massive corpora. These corpora are almost solely crawled from the internet and contain texts of various quality. Corpora that contain more typos or ungrammatical texts are more difficult to use for computational linguists and are thus a major obstacle in automatic development. In this paper we attempt to qualify some of existing Czech corpora using manually created wordlist. We will show that building such a list of frequent typos can be done without major investing when agile techniques are used.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Recent Czech Web Corpora

This article introduces the largest Czech text corpus for language research – czTenTen12 with 5.4 billion tokens. A brief comparison with other recent Czech corpora follows.

متن کامل

Czech-Slovak Parallel Corpora for MT between Closely Related Languages

The paper describes suitable sources for creating Czech-Slovak parallel corpora, including our procedure of creating plain text parallel corpora from various data sources. We attempt to address the pros and cons of various types of data sources, especially when they are used in machine translation. Some results of machine translation from Czech to Slovak based on the acquired corpora are also g...

متن کامل

Neural Networks for Sentiment Analysis in Czech

This paper presents the first attempt at using neural networks for sentiment analysis in Czech. The neural networks have shown very good results on sentiment analysis in English, thus we adapt them to the Czech environment. We first perform experiments on two English corpora to allow comparability with the existing state-ofthe-art methods for sentiment analysis in English. Then we explore the e...

متن کامل

The SYN-series corpora of written Czech

The paper overviews the SYN series of synchronic corpora of written Czech compiled within the framework of the Czech National Corpus project. It describes their design and processing with a focus on the annotation, i.e. lemmatization and morphological tagging. The paper also introduces SYN2013PUB, a new 935-million newspaper corpus of Czech published in 2013 as the most recent addition to the S...

متن کامل

Oral2008: New Balanced Corpus of Spoken Czech 1

Attention paid to spoken language has increased in the last decades, as well as its importance for linguistic research and natural language processing in general. However, compilation of spoken corpora as an indispensable source of data is very laborious and thus expensive. Nevertheless, more and more spoken corpora are being created currently. There are various approaches to their design, dept...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Typos in Czech Corpora

نویسنده

چکیده

منابع مشابه

Recent Czech Web Corpora

Czech-Slovak Parallel Corpora for MT between Closely Related Languages

Neural Networks for Sentiment Analysis in Czech

The SYN-series corpora of written Czech

Oral2008: New Balanced Corpus of Spoken Czech 1

عنوان ژورنال:

اشتراک گذاری